---
title: "Surface response design for bond strength & print visual quality optimization"
author: Laura Symul
date: today
format:
  html:
    code-fold: true
    toc: true
    toc-location: left
    number-sections: true
    embed-resources: true
editor: source
---

```{r}
#| include: false
library(tidyverse)
library(magrittr)
library(rsm)
library(scales)
library(pander)
library(plotly)
library(patchwork)
theme_set(theme_bw())
```

In this document, we plan the experiments and analyze corresponding data to optimize the bond strength and print visual quality of the seal from a machine sealing bags containing sterile material.

![](images/illustration Clamp.png)

# Central-composite design

We anticipate that both the bond and the print quality will be well described by a second-order polynomial model and we have a budget of 20 trials for finding the optimum. Consequently, we could use a central composite design or a Box-Behnken design. Here, we will use a central composite design. As exercise, you could repeat these experiments and analyses with a Box-Behnken design.

```{r}


x_formula <- "Temperature + Time + Pressure + Temperature:Time + Temperature:Pressure + Time:Pressure + I(Temperature^2) + I(Time^2) + I(Pressure^2)"

CCD <- rsm::ccd(
  basis = 3, n0 = c(3,0), randomize = F, alpha = "faces",
  coding = list (
    x1 ~ rescale(Temperature, from = c(120, 180), to = c(-1,1)), 
    x2 ~ rescale(Time, from = c(0.2, 2), to = c(-1,1)),
    x3 ~ rescale(Pressure, from = c(50, 150), to = c(-1,1))
  )
) 

```

The design is:

```{r}
CCD
```

Note: I chose to do 3 repetitions at the center point, such that the total number of experiment sums to 17. Remember that I had a budget of 20 experiments, so, why not doing more repetitions and better estimate the experimental variability?

Well, I could have done that, but I wanted to keep some experiments for validation. Once we have the 17 initial data points from the CCD, we can fit our model, identify the best combination of factor values, then use the remaining 3 experiments to collect additional data at that point to confirm that the obtained values are indeed optimal.

I used the plan generated by the `ccd` function to run the experiments on the [simulator](https://shiny-oer.sipr.ucl.ac.be/doe/).

Note that I did not randomize the order of the experiments. This is not a problem because we're using a simulator and we can assume that the simulator is stable over time. In a real experiment, you should randomize the order of the experiments to avoid any bias due to any change in conditions with time.

Once I've used the simulator to "perform the experiments", I downloaded the results from the simulator using the tab "Report" in the simulator in the "decimal: dot" format. The results are stored in the file `data/CCD_17.csv`. Let's load the data and analyze it.

# Data analysis

```{r}

ccd_df <- read_delim("data/CCD_17.csv", delim = ";")

ccd_df |> pander()

```

## Data visualization

We first do some visual exploration of the data.

```{r}

bond_colors <- colorRampPalette(colors = c("red4","tomato", "steelblue1", "black", "black"))(85*2)
print_colors <- colorRampPalette(colors = c("red4", "tomato", "steelblue1"))(50)

j_bond <- seq(ccd_df$Bond |> min() |> floor(), ccd_df$Bond |> max() |> ceiling())
j_print <- seq((10 * ccd_df$Print) |> min() |> floor(), (10 * ccd_df$Print) |> max() |> ceiling())

ccd_df |>  
  plot_ly(
    type = "scatter3d", mode = "markers",
    x = ~Temperature, y = ~Time, z = ~Pressure,
    color = ~Bond, colors = bond_colors[j_bond],
    text = ~paste0("Bond: ", Bond,"\nPrint: ", Print)
  ) |> 
  add_text(text = ~Trial_number, color = I("black"), showlegend = F) |> 
  layout(
    scene = 
      list(
        xaxis = list(range = c(120, 180)), 
        yaxis = list(range = c(0.2, 2)), 
        zaxis = list(range = c(50, 150), tickvals = seq(50, 150, by = 25))
      )
  )

ccd_df |>  
  plot_ly(
    type = "scatter3d", mode = "markers",
    x = ~Temperature, y = ~Time, z = ~Pressure,
    color = ~Print, colors = print_colors[j_print],
    text = ~paste0("Bond: ", Bond,"\nPrint: ", Print)
  ) |> 
  add_text(text = ~Trial_number, color = I("black"), showlegend = F) |> 
  layout(
    scene = 
      list(
        xaxis = list(range = c(120, 180)), 
        yaxis = list(range = c(0.2, 2)), 
        zaxis = list(range = c(50, 150), tickvals = seq(50, 150, by = 25))
      )
  )

```

## Response = Bond

We start the statistical analysis of the data by modeling the bond strength.

```{r}

bond_formula <- "Bond ~ " |> str_c(x_formula)

mod_bond <- lm(bond_formula, data = ccd_df)
mod_bond |> summary() 

# model.matrix(mod_bond)

```

We observe that the model is significant (very small p-value) and explains a lot of variance in the data (R-squared). We do observe that many parameters are statistically different from 0 except for those including the Pressure. From that, we conclude that the Pressure is not relevant for the strength of the bond.

### Goodness of fit / Lack of fit

Let's further check the quality of our model.

```{r}

tibble(
  actual_bond = ccd_df$Bond,
  predicted_bond = predict(mod_bond)
) |> 
  ggplot(aes(x = actual_bond, y = predicted_bond)) +
  geom_abline(slope = 1, intercept = 0, color = "black") +
  geom_smooth(col = "black", method = "lm", formula = 'y ~ x') +
  geom_point(col = "steelblue1") +
  xlab("Observed bond") + ylab("Predicted bond")

```

Predicted values are very close to obseved values, which is good.

```{r}
  
center_repetitions <- 
  ccd_df |> 
  group_by(Temperature, Time, Pressure) |> 
  mutate(mean_bond = mean(Bond), residuals = Bond - mean_bond, n = n()) |> 
  filter(n > 1) |> ungroup()

tibble(
  residuals = mod_bond$residuals,
  predicted_bond = predict(mod_bond)
) |> 
  ggplot(aes(x = predicted_bond, y = residuals)) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(data = center_repetitions, aes(yintercept = residuals), col = "tomato", linetype = 2) +
  geom_smooth(col = "black", method = "lm", formula = 'y ~ x') +
  geom_point(col = "steelblue1") +
  xlab("Predicted bond") + ylab("Residuals")

```

We see that the residuals are relatively well distributed around 0, which is good. We also see that they are relatively small compared to the residuals at the center point (= the difference between the observed values at the center point - the mean of these values) shown by the horizontal red lines.

From this, it does not look like there is lack of fit, but let's formally test for it.

```{r}

pure_error <- 
  ccd_df |> 
  group_by(Temperature, Time, Pressure) |> 
  mutate(
    mean_bond = mean(Bond),
    residuals = Bond - mean_bond
  ) |> 
  summarise(n = n(), SSE = sum(residuals^2), df = n-1, .groups = "drop") |> 
  filter(n > 1) |> 
  mutate(source = "pure error") |> 
  select(source, df, SSE)

lack_of_fit <- 
  tibble(
    source = "lack of fit",
    df = nrow(ccd_df) - length(mod_bond$coefficients) - pure_error$df,
    SSE = sum(mod_bond$residuals^2) - pure_error$SSE
    )

model_residuals <- 
  tibble(
    source = "model residuals",
    SSE = sum(mod_bond$residuals^2),
    df = mod_bond$df.residual
  )

bind_rows(pure_error, lack_of_fit, model_residuals) |> 
  mutate(MSE = SSE/df) |> 
  mutate(
    F_stat = ifelse(source == "lack of fit", MSE/lag(MSE), NA),
    p_value = pf(F_stat, df[1], df[2], lower.tail = F)
    ) |> pander()

```

### Surface response

Since we saw that the pressure was not very relevant for the bond strength, we can predict the bond for a range of values of the temperature and time, keeping the pressure constant at the center point, and visualize how the bond strength varies with these two factors.

```{r}

newdata <- 
  expand_grid(
  Temperature = seq(min(ccd_df$Temperature), max(ccd_df$Temperature), length.out = 60),
  Time = seq(min(ccd_df$Time), max(ccd_df$Time), length.out = 100),
  Pressure = 100
) 

newdata <- 
  newdata |> 
  mutate(
    bond = predict(mod_bond, newdata = newdata)
  )



newdata |> 
  ggplot(aes(x = Temperature, y = Time)) +
  geom_tile(aes(fill = bond)) +
  scale_fill_gradient2(low = "tomato", mid = "steelblue1", high = "black", midpoint = 85) +
  geom_contour(aes(z = bond), color = "white", binwidth = 10)
  
newdata |> 
  ggplot(aes(x = Temperature, y = Time)) +
  geom_contour_filled(aes(z = bond), binwidth = 5)


bond_wide <- 
  newdata |> 
  select(Temperature, Time, bond) |> 
  pivot_wider(names_from = Time, values_from = bond) |>
  select(-Temperature) |> 
  as.matrix()

plot_ly(x = unique(newdata$Temperature), y = unique(newdata$Time), z = bond_wide) |> 
  add_surface() |> 
  layout(
    scene = 
      list(
        xaxis = list(title = "Temperature"),
        yaxis = list(title = "Time"),
        zaxis = list(title = "Bond")
      )
  )

```

From these plots, we see that there are ranges of coupled values of Temperature & Time that give us the optimal bond.

Let's now repeat these analyses but with the response variable being the print.

## Response = Print

```{r}

print_formula <- "Print ~ " |> str_c(x_formula)

mod_print <- lm(print_formula, data = ccd_df)
mod_print |> summary() 

```

Here too, we observe that the model is significant (very small p-value) and explains a lot of variance in the data (R-squared). We observe that many parameters are statistically different from 0 except for those including the Pressure and the interaction. From that, we conclude that the Pressure is also not relevant for the visual quality of the print - lucky us!

### Goodness of fit / Lack of fit

Let's further check the quality of our model.

```{r}

tibble(
  actual_print = ccd_df$Print,
  predicted_print = predict(mod_print)
) |> 
  ggplot(aes(x = actual_print, y = predicted_print)) +
  geom_abline(slope = 1, intercept = 0, color = "black") +
  geom_smooth(col = "black", method = "lm", formula = 'y ~ x') +
  geom_point(col = "steelblue1") +
  xlab("Observed print") + ylab("Predicted print")

```

Predicted values are very close to obseved values, which is good.

```{r}
  
center_repetitions <- 
  ccd_df |> 
  group_by(Temperature, Time, Pressure) |> 
  mutate(mean_print = mean(Print), residuals = Print - mean_print, n = n()) |> 
  filter(n > 1) |> ungroup()

tibble(
  residuals = mod_print$residuals,
  predicted_print = predict(mod_print)
) |> 
  ggplot(aes(x = predicted_print, y = residuals)) +
  geom_hline(yintercept = 0, color = "black") +
  geom_hline(data = center_repetitions, aes(yintercept = residuals), col = "tomato", linetype = 2) +
  geom_smooth(col = "black", method = "lm", formula = 'y ~ x') +
  geom_point(col = "steelblue1") +
  xlab("Predicted print") + ylab("Residuals")

```

We see that the residuals are relatively well distributed around 0, which is good. We also see that they are of the same range than the residuals at the center point (= the difference between the observed values at the center point - the mean of these values) shown by the horizontal red lines.

From this, it does not look like there is lack of fit, but let's formally test for it.

```{r}

pure_error <- 
  ccd_df |> 
  group_by(Temperature, Time, Pressure) |> 
  mutate(
    mean_print = mean(Print),
    residuals = Print - mean_print
  ) |> 
  summarise(n = n(), SSE = sum(residuals^2), df = n-1, .groups = "drop") |> 
  filter(n > 1) |> 
  mutate(source = "pure error") |> 
  select(source, df, SSE)

lack_of_fit <- 
  tibble(
    source = "lack of fit",
    df = nrow(ccd_df) - length(mod_print$coefficients) - pure_error$df,
    SSE = sum(mod_print$residuals^2) - pure_error$SSE
    )

model_residuals <- 
  tibble(
    source = "model residuals",
    SSE = sum(mod_print$residuals^2),
    df = mod_print$df.residual
  )

bind_rows(pure_error, lack_of_fit, model_residuals) |> 
  mutate(MSE = SSE/df) |> 
  mutate(
    F_stat = ifelse(source == "lack of fit", MSE/lag(MSE), NA),
    p_value = pf(F_stat, df[1], df[2], lower.tail = F)
    ) |> pander()

```

We do not see a significant lack of fit here.

### Surface response

Since we saw that the pressure was not very relevant for the print quality, we can do the same as we did for the bond and predict the print for a range of values of the temperature and time, keeping the pressure constant at the center point, and visualize how the print quality varies with these two factors.

Remember that for the print, the higher the better.

```{r}

newdata <- 
  newdata |> 
  mutate(
    print = predict(mod_print, newdata = newdata)
  )

newdata |> 
  ggplot(aes(x = Temperature, y = Time)) +
  geom_tile(aes(fill = print)) +
  scale_fill_gradient(low = "tomato", high = "steelblue1") +
  geom_contour(aes(z = print), color = "white", binwidth = 0.25)
  
newdata |> 
  ggplot(aes(x = Temperature, y = Time)) +
  geom_contour_filled(aes(z = print), binwidth = 0.25)


print_wide <- 
  newdata |> 
  select(-Pressure, -bond) |> 
  pivot_wider(names_from = Time, values_from = print) |>
  select(-Temperature) |> 
  as.matrix()

plot_ly(x = unique(newdata$Temperature), y = unique(newdata$Time), z = print_wide) |> 
  add_surface() |> 
  layout(
    scene = 
      list(
        xaxis = list(title = "Temperature"),
        yaxis = list(title = "Time"),
        zaxis = list(title = "Print quality")
      )
  )

```

From these plots, we see that there is one point at the optimum, which is obtained for the following values of the factors:

```{r}

newdata |> filter(print == max(print)) |> pander()

```

## Combining results

We can now combine the results and see what are the optimal values for the factors that give us the best bond and print quality.

```{r}

g <- 
  newdata |> 
  ggplot(aes(x = Temperature, y = Time)) +
  geom_contour_filled(aes(z = bond), binwidth = 5) +
  geom_contour(aes(z = print), color = "white", binwidth = 0.1) +
  geom_point(data = newdata |> filter(print == max(print)), col = "white", size = 3, shape = 4)

g

```

From the plot above, we see that the optimum for the print conflicts with the acceptable range for the bond. We will thus pick the values that keep the bond within the acceptable range and maximize the print quality.

We define an acceptable range for the bond as 85 $\pm \ \delta$.

This is achieve with the following values:

```{r}

delta <- 0.5

optimum <- 
  newdata |> 
  filter(bond >= 85 - delta, bond <= 85 + delta) |> 
  filter(print == max(print))

optimum <- 
  optimum |> 
  mutate(
    bond_PI_low = predict(mod_bond, newdata = optimum, interval = "prediction")[2],
    bond_PI_high = predict(mod_bond, newdata = optimum, interval = "prediction")[3],
    print_PI_low = predict(mod_print, newdata = optimum, interval = "prediction")[2],
    print_PI_high = predict(mod_print, newdata = optimum, interval = "prediction")[3]
  ) |> 
  select(Temperature, Time, Pressure, starts_with("bond"), starts_with("print"))

optimum |> pander()

g +
  geom_hline(yintercept = optimum$Time, linetype = 2) +
  geom_vline(xintercept = optimum$Temperature, linetype = 2) +
  geom_point(data = optimum, size = 3)

```

Let's now repeat our experiment at this optimum to see if we indeed obtain good values.

# Validations

```{r}

val_df <- read_delim("data/Validation_3 .csv", delim = ";")

```

```{r}

df <- 
  bind_rows(
    ccd_df |> mutate(type = "experiment"),
    val_df |> mutate(type = "validation")
  ) |> 
  mutate(trial = row_number() |> factor())

df |> 
  ggplot(aes(x = trial, y = Bond, col = type)) +
  geom_hline(yintercept = 85, linetype = 2) +
  geom_hline(yintercept = c(optimum$bond_PI_low, optimum$bond_PI_high), linetype = 3) +
  geom_point(size = 2)  +

df |> 
  ggplot(aes(x = trial, y = Print, col = type)) +
  geom_hline(yintercept = c(optimum$print_PI_low, optimum$print_PI_high), linetype = 3) +
  geom_point(size = 2)  +
  
  plot_layout(guides = "collect")
  
  
# df |> 
#   pivot_longer(cols = c(Bond, Print), names_to = "response", values_to = "value") |>
#   ggplot(aes(x = n, y = value, col = type)) +
#   geom_point() +
#   facet_grid(response ~ ., scales = "free_y")

```

Did we fail?

```{r}

t.test(val_df$Bond, mu = 85, alternative = "two.sided", conf.level = 0.95) |> pander()

```

We didn't :)

(But even if the p-value would have been smaller, this would not have meant that we failed but that the variability is larger than our "delta")
